INN Hotels Project¶

Context¶

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  • Loss of resources (revenue) when the hotel cannot resell the room.
  • Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  • Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  • Human resources to make arrangements for the guests.

Objective¶

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description¶

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Importing necessary libraries and data¶

In [118]:
# To filter the warnings
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library to split data
from sklearn.model_selection import train_test_split


# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant

# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import accuracy_score, roc_curve, confusion_matrix, roc_auc_score,f1_score, precision_score, recall_score,precision_recall_curve,make_scorer

# To get decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get grid search cv
from sklearn.model_selection import GridSearchCV

Data Overview¶

  • Observations
  • Sanity checks
In [2]:
# Loading the dataset

from google.colab import drive
drive.mount('/content/drive')

# read the data
path="/content/drive/MyDrive/Data Science/INNHotelsGroup.csv"
innhotel_df = pd.read_csv(path)
# returns the first 5 rows
innhotel_df.head()
Mounted at /content/drive
Out[2]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50 0 Canceled
In [3]:
# Shape of the data set
innhotel_df.shape
Out[3]:
(36275, 19)

Inn hotel database has 36275 rows and 19 columns in the dataset

In [4]:
# Data type and null count of the dataset
innhotel_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB

Booking_ID,type_of_meal_plan,room_type_reserved,market_segment_type,booking_status : Object data type

no_of_adults, no_of_children, no_of_weekend_nights,no_of_week_nights, required_car_parking_space,lead_time, arrival_year, arrival_month, arrival_date,repeated_guest,no_of_previous_cancellations, no_of_previous_bookings_not_canceled,no_of_special_requests- int datatype

avg_price_per_room - float datatype

In [5]:
#  Getting missing values in the dataset
innhotel_df.isna().sum()
Out[5]:
Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

There are no null or missing values in the dataset

In [6]:
# Checking duplicates
innhotel_df.duplicated().sum()
Out[6]:
0

There are no duplicates in the dataset

In [7]:
# Creating copy of the dataset
innhotel_df2 = innhotel_df.copy()
In [8]:
# Deleting booking ID column
innhotel_df.drop('Booking_ID',axis=1,inplace=True)
# Summary of the dataset
innhotel_df.describe(include='all').T
Out[8]:
count unique top freq mean std min 25% 50% 75% max
no_of_adults 36275.0 NaN NaN NaN 1.844962 0.518715 0.0 2.0 2.0 2.0 4.0
no_of_children 36275.0 NaN NaN NaN 0.105279 0.402648 0.0 0.0 0.0 0.0 10.0
no_of_weekend_nights 36275.0 NaN NaN NaN 0.810724 0.870644 0.0 0.0 1.0 2.0 7.0
no_of_week_nights 36275.0 NaN NaN NaN 2.2043 1.410905 0.0 1.0 2.0 3.0 17.0
type_of_meal_plan 36275 4 Meal Plan 1 27835 NaN NaN NaN NaN NaN NaN NaN
required_car_parking_space 36275.0 NaN NaN NaN 0.030986 0.173281 0.0 0.0 0.0 0.0 1.0
room_type_reserved 36275 7 Room_Type 1 28130 NaN NaN NaN NaN NaN NaN NaN
lead_time 36275.0 NaN NaN NaN 85.232557 85.930817 0.0 17.0 57.0 126.0 443.0
arrival_year 36275.0 NaN NaN NaN 2017.820427 0.383836 2017.0 2018.0 2018.0 2018.0 2018.0
arrival_month 36275.0 NaN NaN NaN 7.423653 3.069894 1.0 5.0 8.0 10.0 12.0
arrival_date 36275.0 NaN NaN NaN 15.596995 8.740447 1.0 8.0 16.0 23.0 31.0
market_segment_type 36275 5 Online 23214 NaN NaN NaN NaN NaN NaN NaN
repeated_guest 36275.0 NaN NaN NaN 0.025637 0.158053 0.0 0.0 0.0 0.0 1.0
no_of_previous_cancellations 36275.0 NaN NaN NaN 0.023349 0.368331 0.0 0.0 0.0 0.0 13.0
no_of_previous_bookings_not_canceled 36275.0 NaN NaN NaN 0.153411 1.754171 0.0 0.0 0.0 0.0 58.0
avg_price_per_room 36275.0 NaN NaN NaN 103.423539 35.089424 0.0 80.3 99.45 120.0 540.0
no_of_special_requests 36275.0 NaN NaN NaN 0.619655 0.786236 0.0 0.0 0.0 1.0 5.0
booking_status 36275 2 Not_Canceled 24390 NaN NaN NaN NaN NaN NaN NaN
In [9]:
# Checking different counts of categorical variables in dataset
for i in innhotel_df.columns:
    if innhotel_df[i].dtypes == object:
        print('There are',innhotel_df[i].nunique(),'Different type of', i ,'for the Innhotel and its counts in dataset as below')
        display()
        display(innhotel_df[i].value_counts(normalize=True))
        display(innhotel_df[i].value_counts())
        print('----------------------------------------------------------------------------')
There are 4 Different type of type_of_meal_plan for the Innhotel and its counts in dataset as below
Meal Plan 1     0.767333
Not Selected    0.141420
Meal Plan 2     0.091110
Meal Plan 3     0.000138
Name: type_of_meal_plan, dtype: float64
Meal Plan 1     27835
Not Selected     5130
Meal Plan 2      3305
Meal Plan 3         5
Name: type_of_meal_plan, dtype: int64
----------------------------------------------------------------------------
There are 7 Different type of room_type_reserved for the Innhotel and its counts in dataset as below
Room_Type 1    0.775465
Room_Type 4    0.166975
Room_Type 6    0.026630
Room_Type 2    0.019076
Room_Type 5    0.007305
Room_Type 7    0.004356
Room_Type 3    0.000193
Name: room_type_reserved, dtype: float64
Room_Type 1    28130
Room_Type 4     6057
Room_Type 6      966
Room_Type 2      692
Room_Type 5      265
Room_Type 7      158
Room_Type 3        7
Name: room_type_reserved, dtype: int64
----------------------------------------------------------------------------
There are 5 Different type of market_segment_type for the Innhotel and its counts in dataset as below
Online           0.639945
Offline          0.290227
Corporate        0.055603
Complementary    0.010779
Aviation         0.003446
Name: market_segment_type, dtype: float64
Online           23214
Offline          10528
Corporate         2017
Complementary      391
Aviation           125
Name: market_segment_type, dtype: int64
----------------------------------------------------------------------------
There are 2 Different type of booking_status for the Innhotel and its counts in dataset as below
Not_Canceled    0.672364
Canceled        0.327636
Name: booking_status, dtype: float64
Not_Canceled    24390
Canceled        11885
Name: booking_status, dtype: int64
----------------------------------------------------------------------------

Observation from data¶

There 36275 rows and 19 columns in the dataset.

  1. booking_id: This is unique id assign to the each reservation. There are 36275 booking_id and no missing values. This is Object datatype.
  2. no_of_adults: This is int data type and has no missing values.Average and median values is about 2. Minimum value is 0 and maximum value is 4.
  3. no_of_children: This is int datatype and has no missing values. Average and median values is almost 0. Minimum value is 0 and maximum value is 10.
  4. no_of_weekend_nights: This is int datatype and no missing values. Average values is 0.81 and median value is 1. Minimum value is 0 and maximum value is 7.
  5. no_of_week_nights: This is int datatype and has no missing values. Average value is 2.2 and median value is 2. Minimum value is 0 and maximum value is 17.
  6. type_of_meal_plan: There are 4 unique meal plan. Almost 77% people select Meal plan 1 followed by 14% Not selected. This is object datatype and has no missing values
  7. require_Car_parking_space: This is int datatype and has no missing values. Average value,Median and minimum values is 0 and maximum value is 1.
  8. room_type_reserved: There are 7 unique room type reserved. Almost 77.5% people select room type 1 followed by 16.7% select room type 4. This is object datatype and has no missing values
  9. lead_time: This is int data type and has no missing values. Average value is 85.23 days and median value is 57 days. Minimum value is 0 and maximum value is 443 days.
  10. arrival_year: This is int datatype and Almost 75% data is for year 2018 and remaining are for 2017 years.
  11. arrival_month: This is int datatype and no missing vlaues. Average value is 7.4 month and median is is 8 month. Minimum value is 1st month and maximum value is 12 month.
  12. market_segment_type: There are 5 unique market segment. 64% is online type followed by 29% offline. This is object datatype and has no missing values.
  13. repeated_guest: This is int datatype and has no missing values. Most of the data has value 0 means they are not repeated guest.
  14. no_of_previous_cancellation: This is int datatype and has no missing values.Most of the data shows no previous cancelllation and maximum number of pevious cancellation is 13.
  15. no_of_previous_booking_not_cancellled: This is int data type and has no missing values. Most of the data shows previous booking not cancelled has value 0 and maximum value is 58.
  16. avg_price_per_room: This is float data type and has no missing values. Minimum avg_price_per_room is 0 and maximum value is 540 dollar. Average price is 103.42 dollar and median is 99.45 dollar
  17. no_of_special_requests: This is int datatype and has no missing values. Minimum value is 0 and maximum value is 5. Average value is 0.62 and median value is 0.
  18. booking_status: There are 2 unique values. There are 67.23% booking were not cancelled and 32.76% bookings were cancelled.
In [10]:
# Getting percent on bar plot
def barplot_values_percent(ax):
    heightlst = []
    for i in ax.patches:
        heightlst.append(i.get_height())
    total = sum(heightlst)

    for i in ax.patches:
        x = i.get_x()+0.05 #adjust the numbers (higher numbers = to the right, lower = to the left)
        height = i.get_height()+0.1 #adjust the numbers (higher numbers = up, lower = down)
        value = ("{0:.2f}".format((i.get_height()/total)*100)+'%')

        ax.text(x, height, value, fontsize=10,color='red')
In [11]:
# Getting median on box plot
import matplotlib.patheffects as path_effects

def add_median_labels(ax, fmt='.1f'):
    lines = ax.get_lines()
    boxes = [c for c in ax.get_children() if type(c).__name__ == 'PathPatch']
    lines_per_box = int(len(lines) / len(boxes))
    for median in lines[4:len(lines):lines_per_box]:
        x, y = (data.mean() for data in median.get_data())
        # choose value depending on horizontal or vertical plot orientation
        value = x if (median.get_xdata()[1] - median.get_xdata()[0]) == 0 else y
        text = ax.text(x, y, f'{value:{fmt}}', ha='center', va='center',
                       fontweight='bold', color='white')
        # create median-colored border around white text for contrast
        text.set_path_effects([
            path_effects.Stroke(linewidth=3, foreground=median.get_color()),
            path_effects.Normal(),
        ])
In [12]:
# Histplot,boxplot and count plot function
def hist(fea,df,kde):
    sns.histplot(x=fea,data=df,kde=kde)
    plt.show()
def box(fea,df):
    ax = sns.boxplot(x=fea,data=df)
    add_median_labels(ax)
    plt.show()
def count(fea,df):
     ax = sns.countplot(x=fea,data=df)
     barplot_values_percent(ax)
     plt.show()

Exploratory Data Analysis (EDA)¶

Univariate analysis of data¶

In [13]:
# Univariate analysis for no of adults
hist('no_of_adults',innhotel_df,False)
box('no_of_adults',innhotel_df)
display(innhotel_df['no_of_adults'].value_counts(normalize=all))
2    0.719724
1    0.212130
3    0.063873
0    0.003832
4    0.000441
Name: no_of_adults, dtype: float64

As you can see 72% of the reservation has 2 adults followed by 21.21% of 1 adults and then 3 adults. There are few reservation has 0 adults and 4 adults also.

In [14]:
# Univariate analysis for no of children
hist('no_of_children',innhotel_df,False)
box('no_of_children',innhotel_df)
display(innhotel_df['no_of_children'].value_counts(normalize=all))
0     0.925624
1     0.044604
2     0.029166
3     0.000524
9     0.000055
10    0.000028
Name: no_of_children, dtype: float64

As you can see 92.56% reservation has 0 no_of_children followed by 4.4% has 1 children in the reservation. Maximum number of children is 10 in the reservation.

In [15]:
# Univariate analysis for no_of_weekend_nights
hist('no_of_weekend_nights',innhotel_df,False)
box('no_of_weekend_nights',innhotel_df)
display(innhotel_df['no_of_weekend_nights'].value_counts(normalize=all))
0    0.465114
1    0.275534
2    0.250062
3    0.004218
4    0.003556
5    0.000937
6    0.000551
7    0.000028
Name: no_of_weekend_nights, dtype: float64

As you can see 46.51% reservation has 0 weekend nights in the booking followed by 27.55% has 1 weekend night and 25% has 2 weekend night in the reservation. Maximum value for no_of_weekend_nights is 7.

In [16]:
# Univariate analysis for no_of_week_nights
hist('no_of_week_nights',innhotel_df,False)
box('no_of_week_nights',innhotel_df)
display(innhotel_df['no_of_week_nights'].value_counts(normalize=all))
2     0.315479
1     0.261558
3     0.216099
4     0.082426
0     0.065803
5     0.044493
6     0.005210
7     0.003115
10    0.001709
8     0.001709
9     0.000937
11    0.000469
15    0.000276
12    0.000248
14    0.000193
13    0.000138
17    0.000083
16    0.000055
Name: no_of_week_nights, dtype: float64

As you can see that 31.54% reservation has no of 2 week nights followed by 26.15% has 1 week nights reservation. Maximum value for no_of_week_nights is 17.

In [17]:
# Univariate analysis for type_of_meal_plan
count('type_of_meal_plan',innhotel_df)

As you can see 76.73% reservation has selected Meal plan 1 followed by 14.14% does not selected any meal plan.Almost 9.11% reservation has selected meal plan 2 and only 0.01% has selected plan3

In [18]:
# Univariate analysis for required_car_parking_space
hist('required_car_parking_space',innhotel_df,False)
box('required_car_parking_space',innhotel_df)
display(innhotel_df['required_car_parking_space'].value_counts(normalize=all))
0    0.969014
1    0.030986
Name: required_car_parking_space, dtype: float64

As you can see only 96.9% resevation does not show required car parking space and only 3.1% shows need of car parking space.

In [19]:
# Univariate analysis for room_type_reserved
plt.figure(figsize=(12, 7))
count('room_type_reserved',innhotel_df)

As you can see 77.55% made Room_type1 reservation followed by 16.7% made Room Type 4 reservation. Remaining 5 type of room type reservation are adds upto less then 6%.

In [20]:
# Univariate analysis for lead_time
hist('lead_time',innhotel_df,True)
box('lead_time',innhotel_df)

As you can see more then 5000 peole does not make reservation in advance.Average value for the lead time is 85.23 days and median value is 57 days. Some reservation has lead time of more then 400 days also.It shows right skewed data.

In [21]:
# Univariate analysis for arrival_year
hist('arrival_year',innhotel_df,False)
box('arrival_year',innhotel_df)
display(innhotel_df['arrival_year'].value_counts(normalize=all))
2018    0.820427
2017    0.179573
Name: arrival_year, dtype: float64

As you can see almost 82% data is from year 2018 and only 17.95% data is from year 2017.

What are the busiest months in the hotel?

In [22]:
hist('arrival_month',innhotel_df,False)
In [23]:
# Univariate analysis for arrival_month for year 2018
hist('arrival_month',innhotel_df[innhotel_df['arrival_year'] == 2018],False)
box('arrival_month',innhotel_df)

As you can see most people like to make reservation in the 6th and 10th month for the year 2018. There are very few reservation are in first quarter of the year.

In [24]:
# Univariate analysis for arrival_date
hist('arrival_date',innhotel_df,False)
box('arrival_date',innhotel_df)

As you can see data has almost same number of reservation on different days throughout the year.

Which market segment do most of the guests come from?

In [25]:
# Univariate analysis for market_segment_type
count('market_segment_type',innhotel_df)

As you can see 64% are online market segment type followed by 29% offline segment type for reservation. Corporate market segment is about 5.56%.

In [26]:
# Univariate analysis for repeated_guest
hist('repeated_guest',innhotel_df,False)
box('repeated_guest',innhotel_df)
display(innhotel_df['repeated_guest'].value_counts(normalize=all))
0    0.974363
1    0.025637
Name: repeated_guest, dtype: float64

As you can see only 2.57% reservation made by repeated guest. Most of the reservation are from new guests.

In [27]:
# Univariate analysis for no_of_previous_cancellations
hist('no_of_previous_cancellations',innhotel_df,False)
box('no_of_previous_cancellations',innhotel_df)
display(innhotel_df['no_of_previous_cancellations'].value_counts(normalize=all))
0     0.990682
1     0.005458
2     0.001268
3     0.001185
11    0.000689
5     0.000303
4     0.000276
13    0.000110
6     0.000028
Name: no_of_previous_cancellations, dtype: float64

Almost 99.06% reservation are not previously cancelled or are booked by new customer. Maximum number of previous cancellation is 13.

In [28]:
# Univariate analysis for no_of_previous_booking_not_cancelled
hist('no_of_previous_bookings_not_canceled',innhotel_df,False)
box('no_of_previous_bookings_not_canceled',innhotel_df)

As you can see almost 99% of the bookings has 0 values and maximum value is 58

In [29]:
# Univariate analysis for avg_price_per_room
plt.figure(figsize=(12, 7))
hist('avg_price_per_room',innhotel_df,True)
box('avg_price_per_room',innhotel_df)

Minimum avg_price_per_room is 0 and maximum value is 540 dollar. Average price is 103.42 dollar and median is 99.45 dollar. It does have kind of normal distribution but we need to treat the dollar 0 price for bookings.

In [30]:
# Univariate analysis for no_of_special_requests
hist('no_of_special_requests',innhotel_df,False)
box('no_of_special_requests',innhotel_df)
display(innhotel_df['no_of_special_requests'].value_counts(normalize=all))
0    0.545196
1    0.313522
2    0.120303
3    0.018608
4    0.002150
5    0.000221
Name: no_of_special_requests, dtype: float64

As you can see almost 54.5% reservation has made 0 special request followed by 31.35% has made 1 request. Some reservation has 4 and 5 requests also.

What percentage of bookings are canceled?

In [31]:
# Univariate analysis for booking_status
count('booking_status',innhotel_df)

As you can see almost 67.24% reservation are not cancelled and 32.76% reservation are cancelled by customer.

In [32]:
# Replacing not cancelled with 0 and canceled with 1 for booking status

mappings = {'Not_Canceled':0, 'Canceled':1}

innhotel_df['booking_status'] = innhotel_df['booking_status'].replace(mappings)

Bi Variate Analysis¶

In [33]:
# Heat plot for all numeric variable
plt.figure(figsize=(12, 7))
sns.heatmap(innhotel_df.corr(), annot=True, vmin=-1, vmax=1)
plt.show()

As you can see that booking_status is in highly correlation with lead_time with factor of 0.44.It has positive correalation of 0.14 with avg_price_per_room also and 0.18 with arrival_year. avg_price_per_room is in positive correaltion with no_of_adults and no_of_children with factor of 0.3 and 0.34 respectively.

In [34]:
# Creating copy of dataset and converting few variables to category type
innhotel_df5 = innhotel_df.copy()
In [35]:
# Avg_price_per_room vs lead_time and booking_status
plt.figure(figsize=(20, 7))
sns.scatterplot(x='lead_time',y='avg_price_per_room',data=innhotel_df5,hue='booking_status')
Out[35]:
<Axes: xlabel='lead_time', ylabel='avg_price_per_room'>
In [36]:
# reservation made in advance by more then 150 days
innhotel_df5[innhotel_df5['lead_time']> 150]['booking_status'].value_counts(normalize='True')
Out[36]:
1    0.716248
0    0.283752
Name: booking_status, dtype: float64
In [37]:
# reservation made in advance by less then 150 days
innhotel_df5[innhotel_df5['lead_time']< 150]['booking_status'].value_counts(normalize='True')
Out[37]:
0    0.769859
1    0.230141
Name: booking_status, dtype: float64

As you can see if booking made in well advanced then it has low average price per room but more cancellation happen when lead time is high or reservation is made in very well advance. As you can see if reservation is made more then 150 days in advance then it has 72% chance of cancellation compared with only 23% chances of cancellation when it is made less then 150 days advance.

In [38]:
# Avg_price_per_room vs arrival_month and year
plt.figure(figsize=(20, 7))
sns.catplot(x='arrival_month',y='avg_price_per_room',data=innhotel_df5,col='arrival_year',kind='point')
Out[38]:
<seaborn.axisgrid.FacetGrid at 0x792b3408efe0>
<Figure size 2000x700 with 0 Axes>

As you can see that reservation cost less if that made in early quarter of the year and it goes up during summer from month 5 to 9 and then starts going down during last 2 months.When you compare the rate for the year 2018 is quiet higher then compared with the year 2017.

In [39]:
# Avg_price_per_room vs booking status
plt.figure(figsize=(20, 7))
sns.catplot(x='booking_status',y='avg_price_per_room',data=innhotel_df5,kind='box')
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x792b2f66c7f0>
<Figure size 2000x700 with 0 Axes>

As you can avg_price_per_room is little high for cancelled reservation compared with not cancelled reservation but not much difference.

In [40]:
# Total guest and booking status
innhotel_df5['total_guest'] = innhotel_df5['no_of_adults'] + innhotel_df5['no_of_children']
plt.figure(figsize=(20, 7))
sns.catplot(x='total_guest',data=innhotel_df5,hue='booking_status',kind='count')
innhotel_df5.groupby('total_guest')['booking_status'].value_counts(normalize=True)
Out[40]:
total_guest  booking_status
1            0                 0.760461
             1                 0.239539
2            0                 0.654164
             1                 0.345836
3            0                 0.638535
             1                 0.361465
4            0                 0.563596
             1                 0.436404
5            0                 0.666667
             1                 0.333333
10           0                 1.000000
11           1                 1.000000
12           0                 1.000000
Name: booking_status, dtype: float64
<Figure size 2000x700 with 0 Axes>

As you can see cancellation proportion is about 24% when booked by only 1 guest and when it is booked for 2 guest or 3 guest cancellation rate is almost 35%. Cancelllation rate is almost 44% when it is booked for 4 guest.

In [41]:
# Market_segment-type vs booking_Status
sns.countplot(x='market_segment_type',data=innhotel_df5,hue='booking_status')
innhotel_df5.groupby('market_segment_type')['booking_status'].value_counts(normalize=True)
Out[41]:
market_segment_type  booking_status
Aviation             0                 0.704000
                     1                 0.296000
Complementary        0                 1.000000
Corporate            0                 0.890927
                     1                 0.109073
Offline              0                 0.700513
                     1                 0.299487
Online               0                 0.634919
                     1                 0.365081
Name: booking_status, dtype: float64

As you can see cancellation rate is high for online market segment type and very low for Corporate and Complimentary market segment type.

Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

In [42]:
# Market_segment-type vs avg_price_per_room
plt.figure(figsize=(20, 7))
sns.pointplot(x='market_segment_type',y='avg_price_per_room',data=innhotel_df5,hue='booking_status')
Out[42]:
<Axes: xlabel='market_segment_type', ylabel='avg_price_per_room'>
In [43]:
innhotel_df5.groupby('market_segment_type')['avg_price_per_room'].mean()
Out[43]:
market_segment_type
Aviation         100.704000
Complementary      3.141765
Corporate         82.911740
Offline           91.632679
Online           112.256855
Name: avg_price_per_room, dtype: float64

As you can see avg_price_per_room is high for online market segment about dollar 112.25 followed by Aviation dollar 100.70. It is low for Corporate market segment about dollar 82.9.It is dollar 91.63 for offline market segment.It is also observed that cost for cancelled status is higher then not cancelled across all market segment type.

Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

In [44]:
display(innhotel_df5.groupby('repeated_guest')['booking_status'].value_counts())
display(innhotel_df5.groupby('repeated_guest')['booking_status'].value_counts(normalize=True))
ax = sns.countplot(x='repeated_guest',data=innhotel_df5[innhotel_df5['repeated_guest'] == 1],hue='booking_status')
barplot_values_percent(ax)
repeated_guest  booking_status
0               0                 23476
                1                 11869
1               0                   914
                1                    16
Name: booking_status, dtype: int64
repeated_guest  booking_status
0               0                 0.664196
                1                 0.335804
1               0                 0.982796
                1                 0.017204
Name: booking_status, dtype: float64

As you can see that repeated guest has 98.28% chance of not cancelling the reservation and only 1.72% chance of cancelling reservation.

Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

In [45]:
sns.countplot(x='no_of_special_requests',data=innhotel_df5,hue='booking_status')
display(innhotel_df5.groupby('no_of_special_requests')['booking_status'].value_counts())
display(innhotel_df5.groupby('no_of_special_requests')['booking_status'].value_counts(normalize=True))
no_of_special_requests  booking_status
0                       0                 11232
                        1                  8545
1                       0                  8670
                        1                  2703
2                       0                  3727
                        1                   637
3                       0                   675
4                       0                    78
5                       0                     8
Name: booking_status, dtype: int64
no_of_special_requests  booking_status
0                       0                 0.567932
                        1                 0.432068
1                       0                 0.762332
                        1                 0.237668
2                       0                 0.854033
                        1                 0.145967
3                       0                 1.000000
4                       0                 1.000000
5                       0                 1.000000
Name: booking_status, dtype: float64

As you can see When the customer make special request the chances of booking cancellation goes down with that. Customer without special request has cancellation rate of 43.2% and with just 1 request the cancellation rate is about 23.7%. With 3,4 and 5 request cancellation rate is 0.

Data Preprocessing¶

Missing value treatment¶

In [46]:
innhotel_df.isna().sum()
Out[46]:
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

There are no missing values

Feature Engineering¶

In [47]:
# Checking to make sure either no of adults or children is specified for reservation and it is not empty
no_peoplemiss = innhotel_df.query("no_of_adults < 0.5 and no_of_children < 0.5")
no_peoplemiss.shape
Out[47]:
(0, 18)

None of the data has both no of adults or children missing together

In [48]:
# Avg price of the room has 0 so we need to treat those
innhotel_df[innhotel_df['avg_price_per_room'] <= 0].shape
Out[48]:
(545, 18)

There are 545 rows have avg price for the room 0 so we can replace it with median price for the dataset

In [49]:
# Replacing 0 avg_price_per_room  with median avg_price_per_room
innhotel_df.loc[innhotel_df['avg_price_per_room'] <= 0, 'avg_price_per_room'] = innhotel_df['avg_price_per_room'].median()
In [50]:
# Checking to make sure no data has avg_price_per_room = 0
innhotel_df[innhotel_df['avg_price_per_room'] <= 0].shape
Out[50]:
(0, 18)

Checking for Outliers¶

In [51]:
# outlier detection using boxplot
# selecting the numerical columns of data and adding their names in a list
numeric_columns = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'required_car_parking_space',
       'lead_time', 'arrival_year', 'arrival_month',
       'arrival_date', 'repeated_guest',
       'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
       'avg_price_per_room', 'no_of_special_requests']
plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(innhotel_df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
In [52]:
# to find the 25th percentile and 75th percentile for the numerical columns.
Q1 = innhotel_df[numeric_columns].quantile(0.25)
Q3 = innhotel_df[numeric_columns].quantile(0.75)

IQR = Q3 - Q1                   #Inter Quantile Range (75th percentile - 25th percentile)

lower_whisker = Q1 - 1.5*IQR    #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5*IQR
In [53]:
# Percentage of outliers in each column
((innhotel_df[numeric_columns] < lower_whisker) | (innhotel_df[numeric_columns] > upper_whisker)).sum()/innhotel_df.shape[0]*100
Out[53]:
no_of_adults                            28.027567
no_of_children                           7.437629
no_of_weekend_nights                     0.057891
no_of_week_nights                        0.893177
required_car_parking_space               3.098553
lead_time                                3.669194
arrival_year                            17.957271
arrival_month                            0.000000
arrival_date                             0.000000
repeated_guest                           2.563749
no_of_previous_cancellations             0.931771
no_of_previous_bookings_not_canceled     2.238456
avg_price_per_room                       3.244659
no_of_special_requests                   2.097864
dtype: float64

Most of the outlier present in data looks normal except for no of children more then 3 and avg price per room more then 300.We will check how many data points have average rate more then 300$ and children more then 3

In [54]:
# Checking to make sure how many data point has avg_price_per_room > 300
innhotel_df[innhotel_df['avg_price_per_room'] > 300].shape
Out[54]:
(9, 18)
In [55]:
# Checking to make sure how many data point has no of children > 3
innhotel_df[innhotel_df['no_of_children'] > 3].shape
Out[55]:
(3, 18)

As we can see only 9 data points has avg_price_per_room > 300 and 3 data points has no_of_children > 3 which is possible to have so we don't have to treat this outlier.

Preparing data for modelling¶

In [56]:
# Creating Dummy variable for all object datatype in database
innhotel_df3 = pd.get_dummies(innhotel_df, columns=['type_of_meal_plan', 'room_type_reserved', 'market_segment_type'], drop_first=True)
In [57]:
innhotel_df3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 28 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          36275 non-null  int64  
 1   no_of_children                        36275 non-null  int64  
 2   no_of_weekend_nights                  36275 non-null  int64  
 3   no_of_week_nights                     36275 non-null  int64  
 4   required_car_parking_space            36275 non-null  int64  
 5   lead_time                             36275 non-null  int64  
 6   arrival_year                          36275 non-null  int64  
 7   arrival_month                         36275 non-null  int64  
 8   arrival_date                          36275 non-null  int64  
 9   repeated_guest                        36275 non-null  int64  
 10  no_of_previous_cancellations          36275 non-null  int64  
 11  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 12  avg_price_per_room                    36275 non-null  float64
 13  no_of_special_requests                36275 non-null  int64  
 14  booking_status                        36275 non-null  int64  
 15  type_of_meal_plan_Meal Plan 2         36275 non-null  uint8  
 16  type_of_meal_plan_Meal Plan 3         36275 non-null  uint8  
 17  type_of_meal_plan_Not Selected        36275 non-null  uint8  
 18  room_type_reserved_Room_Type 2        36275 non-null  uint8  
 19  room_type_reserved_Room_Type 3        36275 non-null  uint8  
 20  room_type_reserved_Room_Type 4        36275 non-null  uint8  
 21  room_type_reserved_Room_Type 5        36275 non-null  uint8  
 22  room_type_reserved_Room_Type 6        36275 non-null  uint8  
 23  room_type_reserved_Room_Type 7        36275 non-null  uint8  
 24  market_segment_type_Complementary     36275 non-null  uint8  
 25  market_segment_type_Corporate         36275 non-null  uint8  
 26  market_segment_type_Offline           36275 non-null  uint8  
 27  market_segment_type_Online            36275 non-null  uint8  
dtypes: float64(1), int64(14), uint8(13)
memory usage: 4.6 MB

Building a Logistic Regression model¶


Model evaluation criterion

We will make wrong prediction in 2 ways

1) It will be predicting customer will cancel the booking but in reality they does not cancel the booking. In this case we will not be able to provide good service to customer as we will not have enough man power to privde good service.

2) It will be predicting customer will not cancel the booking but in reality they does cancel the booking. In this case we will loose the revenue and end up in lower profit margin if we try to sell it for last minute at lower cost.

I think both the case will be important to us so

We want to maximize F1 score as FN and FP both are imporant in our case.

In [58]:
# independent variables
X = innhotel_df3.drop(["booking_status"], axis=1)
# dependent variable
y = innhotel_df3[["booking_status"]]
# this adds the constant term to the dataset
X = sm.add_constant(X)
In [59]:
# Spliting data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1,stratify=y
)
In [60]:
#from statsmodels.genmod.families.links import Logit
# Fitting the model
logit = sm.Logit(y_train, X_train)
lg = logit.fit()
Optimization terminated successfully.
         Current function value: 0.423894
         Iterations 26
In [61]:
# let's print the logistic regression summary
print(lg.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25364
Method:                           MLE   Df Model:                           27
Date:                Thu, 02 Nov 2023   Pseudo R-squ.:                  0.3298
Time:                        22:14:01   Log-Likelihood:                -10764.
converged:                       True   LL-Null:                       -16060.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -952.0595    121.184     -7.856      0.000   -1189.575    -714.544
no_of_adults                             0.0538      0.038      1.432      0.152      -0.020       0.127
no_of_children                           0.0973      0.060      1.609      0.108      -0.021       0.216
no_of_weekend_nights                     0.1483      0.020      7.484      0.000       0.109       0.187
no_of_week_nights                        0.0375      0.012      3.062      0.002       0.014       0.062
required_car_parking_space              -1.5989      0.137    -11.663      0.000      -1.868      -1.330
lead_time                                0.0157      0.000     58.817      0.000       0.015       0.016
arrival_year                             0.4705      0.060      7.834      0.000       0.353       0.588
arrival_month                           -0.0465      0.006     -7.178      0.000      -0.059      -0.034
arrival_date                             0.0030      0.002      1.545      0.122      -0.001       0.007
repeated_guest                          -1.9350      0.758     -2.553      0.011      -3.420      -0.450
no_of_previous_cancellations             0.3472      0.102      3.413      0.001       0.148       0.547
no_of_previous_bookings_not_canceled    -1.3735      0.903     -1.522      0.128      -3.143       0.396
avg_price_per_room                       0.0179      0.001     23.660      0.000       0.016       0.019
no_of_special_requests                  -1.4826      0.030    -48.831      0.000      -1.542      -1.423
type_of_meal_plan_Meal Plan 2            0.1633      0.067      2.443      0.015       0.032       0.294
type_of_meal_plan_Meal Plan 3           32.2368   6.52e+06   4.95e-06      1.000   -1.28e+07    1.28e+07
type_of_meal_plan_Not Selected           0.2014      0.053      3.781      0.000       0.097       0.306
room_type_reserved_Room_Type 2          -0.4188      0.133     -3.154      0.002      -0.679      -0.159
room_type_reserved_Room_Type 3           1.2013      1.891      0.635      0.525      -2.506       4.908
room_type_reserved_Room_Type 4          -0.2534      0.053     -4.752      0.000      -0.358      -0.149
room_type_reserved_Room_Type 5          -0.6678      0.214     -3.114      0.002      -1.088      -0.247
room_type_reserved_Room_Type 6          -0.8170      0.153     -5.350      0.000      -1.116      -0.518
room_type_reserved_Room_Type 7          -1.3282      0.298     -4.462      0.000      -1.912      -0.745
market_segment_type_Complementary      -90.2612   2.74e+13  -3.29e-12      1.000   -5.38e+13    5.38e+13
market_segment_type_Corporate           -0.8404      0.276     -3.046      0.002      -1.381      -0.300
market_segment_type_Offline             -1.7530      0.264     -6.642      0.000      -2.270      -1.236
market_segment_type_Online              -0.0095      0.261     -0.037      0.971      -0.521       0.502
========================================================================================================

# Model performance evaluation

In [62]:
# Function to evaluate model performance
def score(model,train,act,desc,n):
    """
      Inputs:
      Used to evaluate and check model performance
      model: model used to fit the data
      train : training set or testing set or X
      act: actual data from dataset or y
      desc: just for printing to make sure test or train
      n: threshold value

      Outputs:
      Recall,Precesion,Accuracy,F1 score

    """
    pred1 = model.predict(train) > n
    pred = np.round(pred1)
    pc_test = precision_score(act, pred)
    print("The precision score is {pc:.3f}".format(pc = pc_test))
    rc_test = recall_score(act, pred)
    print("The recall score is {rc:.3f}".format(rc = rc_test))
    ac_test = accuracy_score(act, pred)
    print("The accuracy score is {ac:.3f}".format(ac = ac_test))
    f1_test = f1_score(act, pred)
    print("The F1 score is {f1:.3f}".format(f1 = f1_test))
# defining a function to plot the confusion_matrix of a classification model
    cm = confusion_matrix(act, pred)
    plt.figure(figsize=(7, 5))
    sns.heatmap(cm, annot=True, fmt="g")
    plt.xlabel("Predicted Values")
    plt.ylabel("Actual Values")
    plt.show()
# Printing result
    df_pred = pd.DataFrame()
    df_pred["Recall"] = [rc_test]
    df_pred["Precesion"] = [pc_test]
    df_pred["Accuracy"] = [ac_test]
    df_pred["F1_score"] = [f1_test]
    print( "Result for the",desc,"model are:",'\n')
    return df_pred
In [63]:
# Evaluating training model performance
log_reg_model_train_perf = score(lg,X_train,y_train,'Training',0.5)
log_reg_model_train_perf
The precision score is 0.739
The recall score is 0.630
The accuracy score is 0.806
The F1 score is 0.680
Result for the Training model are: 

Out[63]:
Recall Precesion Accuracy F1_score
0 0.630364 0.739112 0.806002 0.68042
In [64]:
# Evaluating testing model performance
log_reg_model_test_perf = score(lg,X_test,y_test,'Test',0.5)
log_reg_model_test_perf
The precision score is 0.735
The recall score is 0.624
The accuracy score is 0.803
The F1 score is 0.675
Result for the Test model are: 

Out[64]:
Recall Precesion Accuracy F1_score
0 0.623948 0.734808 0.802995 0.674856

Checking for Multicollinearity¶

In [65]:
# let's check the VIF of the predictors
vif_series = pd.Series(
    [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
    index=X_train.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series.sort_values()))
VIF values: 

room_type_reserved_Room_Type 3          1.003920e+00
arrival_date                            1.007679e+00
type_of_meal_plan_Meal Plan 3           1.008016e+00
room_type_reserved_Room_Type 5          1.032550e+00
required_car_parking_space              1.034472e+00
no_of_weekend_nights                    1.070613e+00
room_type_reserved_Room_Type 2          1.095021e+00
no_of_week_nights                       1.096991e+00
room_type_reserved_Room_Type 7          1.097799e+00
no_of_special_requests                  1.246800e+00
type_of_meal_plan_Meal Plan 2           1.283243e+00
arrival_month                           1.283548e+00
type_of_meal_plan_Not Selected          1.286599e+00
no_of_previous_cancellations            1.322032e+00
no_of_adults                            1.339355e+00
room_type_reserved_Room_Type 4          1.361858e+00
lead_time                               1.408643e+00
arrival_year                            1.429625e+00
no_of_previous_bookings_not_canceled    1.570753e+00
repeated_guest                          1.749581e+00
avg_price_per_room                      1.953082e+00
no_of_children                          2.005270e+00
room_type_reserved_Room_Type 6          2.008492e+00
market_segment_type_Complementary       4.175391e+00
market_segment_type_Corporate           1.663647e+01
market_segment_type_Offline             6.251368e+01
market_segment_type_Online              6.949159e+01
const                                   3.949439e+07
dtype: float64

We can see dummy variable for market_segment_type are showing multi collinearity but we can ignore that for the time being and we will check for p values

  • Now we can remove column with p values more then 0.05 but we can not drop all columns at once so we will follow following step
  • Build a model, check the p-values of the variables, and drop the column with the highest p-value
  • Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value
  • Repeat the above two steps till there are no columns with p-value > 0.05
In [66]:
 # initial list of columns
predictors = X_train.copy()
cols = predictors.columns.tolist()

# setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    # defining the train set
    x_train_aux = predictors[cols]

    # fitting the model
    model = sm.Logit(y_train, x_train_aux).fit()

    # getting the p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    # name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)
Optimization terminated successfully.
         Current function value: 0.423894
         Iterations 26
Optimization terminated successfully.
         Current function value: 0.424666
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.424668
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.424677
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.424704
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.424742
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.424782
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.424949
         Iterations 11
Optimization terminated successfully.
         Current function value: 0.424998
         Iterations 11
['const', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Offline', 'market_segment_type_Online']
In [67]:
# Taking available features after removing all columns with high p values
X_train3 = X_train[selected_features]
X_test3 = X_test[selected_features]
In [68]:
# Fitting the model
logit3 = sm.Logit(y_train, X_train3)
lg3 = logit3.fit()
Optimization terminated successfully.
         Current function value: 0.424998
         Iterations 11
In [69]:
# let's print the logistic regression summary
print(lg3.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25372
Method:                           MLE   Df Model:                           19
Date:                Thu, 02 Nov 2023   Pseudo R-squ.:                  0.3280
Time:                        22:14:08   Log-Likelihood:                -10792.
converged:                       True   LL-Null:                       -16060.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                           -963.8178    120.702     -7.985      0.000   -1200.389    -727.247
no_of_weekend_nights               0.1538      0.020      7.781      0.000       0.115       0.193
no_of_week_nights                  0.0401      0.012      3.278      0.001       0.016       0.064
required_car_parking_space        -1.5913      0.137    -11.617      0.000      -1.860      -1.323
lead_time                          0.0158      0.000     59.383      0.000       0.015       0.016
arrival_year                       0.4759      0.060      7.956      0.000       0.359       0.593
arrival_month                     -0.0474      0.006     -7.333      0.000      -0.060      -0.035
repeated_guest                    -3.0650      0.594     -5.159      0.000      -4.229      -1.901
no_of_previous_cancellations       0.2871      0.078      3.702      0.000       0.135       0.439
avg_price_per_room                 0.0182      0.001     24.554      0.000       0.017       0.020
no_of_special_requests            -1.4788      0.030    -49.144      0.000      -1.538      -1.420
type_of_meal_plan_Meal Plan 2      0.1653      0.067      2.476      0.013       0.034       0.296
type_of_meal_plan_Not Selected     0.2042      0.053      3.858      0.000       0.100       0.308
room_type_reserved_Room_Type 2    -0.3795      0.129     -2.953      0.003      -0.631      -0.128
room_type_reserved_Room_Type 4    -0.2361      0.052     -4.578      0.000      -0.337      -0.135
room_type_reserved_Room_Type 5    -0.6578      0.213     -3.083      0.002      -1.076      -0.240
room_type_reserved_Room_Type 6    -0.6696      0.120     -5.558      0.000      -0.906      -0.433
room_type_reserved_Room_Type 7    -1.2378      0.291     -4.251      0.000      -1.808      -0.667
market_segment_type_Offline       -0.8637      0.100     -8.607      0.000      -1.060      -0.667
market_segment_type_Online         0.8874      0.095      9.295      0.000       0.700       1.075
==================================================================================================
In [70]:
# Evaluating training model performance
log_reg_model_train_perf = score(lg3,X_train3,y_train,'Training',0.5)
log_reg_model_train_perf
The precision score is 0.738
The recall score is 0.630
The accuracy score is 0.806
The F1 score is 0.680
Result for the Training model are: 

Out[70]:
Recall Precesion Accuracy F1_score
0 0.629763 0.738303 0.805569 0.679728
In [71]:
# Evaluating testing model performance
log_reg_model_test_perf = score(lg3,X_test3,y_test,'Test',0.5)
log_reg_model_test_perf
The precision score is 0.733
The recall score is 0.623
The accuracy score is 0.802
The F1 score is 0.674
Result for the Test model are: 

Out[71]:
Recall Precesion Accuracy F1_score
0 0.623388 0.732938 0.802169 0.673738

Coefficient interpretations¶

Converting coefficients to odds

The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.

Therefore, odds = exp(b) The percentage change in odds is given as odds = (exp(b) - 1) * 100

In [72]:
# converting coefficients to odds
odds = np.exp(lg3.params)

# finding the percentage change
perc_change_odds = (np.exp(lg3.params) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train3.columns).T
Out[72]:
const no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month repeated_guest no_of_previous_cancellations avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Offline market_segment_type_Online
Odds 0.0 1.166227 1.040926 0.203656 1.015886 1.609466 0.953728 0.046652 1.332599 1.018405 0.227918 1.179792 1.226595 0.684213 0.789676 0.517972 0.511895 0.290028 0.421580 2.428786
Change_odd% -100.0 16.622697 4.092632 -79.634365 1.588584 60.946585 -4.627241 -95.334836 33.259873 1.840518 -77.208169 17.979158 22.659467 -31.578671 -21.032363 -48.202750 -48.810502 -70.997222 -57.841969 142.878621

Coefficient Interpretation:

Coefficient of no_of_weekend_nights,no_of_week_nights,lead_time, arrival_year,no_of_previous_cancellations,acerage_price_per_year,some levels of mealplan, some level of market segment are positive an increase in these will lead to increase in chances of a person cancelling the reservation.

Coefficient of required_car_parking_space,arrival_month,repeated_guest, no_of_special_requests,some level of room type and some level of market segment are negative so an increase in these will lead to decrease in chances of a person cancelling the reservation.

In [73]:
# Receiver operating characteristic curve
logit_roc_auc_train = roc_auc_score(y_train, lg3.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

Optimal threshold using AUC-ROC curve¶

In [74]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3273550051135727
In [75]:
# Evaluating training model performance with optimal threshold from AUC-ROC curve
log_reg_model_train_perf_threshold_auc_roc = score(lg3,X_train3,y_train,'Training',0.3274)
log_reg_model_train_perf_threshold_auc_roc
The precision score is 0.640
The recall score is 0.768
The accuracy score is 0.783
The F1 score is 0.699
Result for the Training model are: 

Out[75]:
Recall Precesion Accuracy F1_score
0 0.768362 0.640481 0.782806 0.698617
In [76]:
# Evaluating testing model performance
log_reg_model_test_perf_threshold_auc_roc = score(lg3,X_test3,y_test,'Testing',0.3274)
log_reg_model_test_perf_threshold_auc_roc
The precision score is 0.632
The recall score is 0.769
The accuracy score is 0.777
The F1 score is 0.694
Result for the Testing model are: 

Out[76]:
Recall Precesion Accuracy F1_score
0 0.769209 0.631737 0.777451 0.693728

Let's use Precision-Recall curve and see if we can find a better threshold¶

In [77]:
y_scores = lg3.predict(X_train3)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

We can use threshold of 0.42 as per the Precision-Recall curve.Let's use that and check for model performance

In [78]:
# Evaluating training model performance
log_reg_model_train_perf_threshold_prcurve = score(lg3,X_train3,y_train,'Training',0.42)
log_reg_model_train_perf_threshold_prcurve
The precision score is 0.699
The recall score is 0.698
The accuracy score is 0.802
The F1 score is 0.698
Result for the Training model are: 

Out[78]:
Recall Precesion Accuracy F1_score
0 0.69756 0.698904 0.802457 0.698231
In [79]:
# Evaluating testing model performance
log_reg_model_test_perf_threshold_prcurve = score(lg3,X_test3,y_test,'Testing',0.42)
log_reg_model_test_perf_threshold_prcurve
The precision score is 0.690
The recall score is 0.694
The accuracy score is 0.798
The F1 score is 0.692
Result for the Testing model are: 

Out[79]:
Recall Precesion Accuracy F1_score
0 0.694335 0.690078 0.797666 0.6922

Final Model Summary¶

In [80]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_prcurve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.327 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[80]:
Logistic Regression-default Threshold Logistic Regression-0.327 Threshold Logistic Regression-0.42 Threshold
Recall 0.629763 0.768362 0.697560
Precesion 0.738303 0.640481 0.698904
Accuracy 0.805569 0.782806 0.802457
F1_score 0.679728 0.698617 0.698231
In [81]:
# testing performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_prcurve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.327 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Testing performance comparison:")
models_train_comp_df
Testing performance comparison:
Out[81]:
Logistic Regression-default Threshold Logistic Regression-0.327 Threshold Logistic Regression-0.42 Threshold
Recall 0.623388 0.769209 0.694335
Precesion 0.732938 0.631737 0.690078
Accuracy 0.802169 0.777451 0.797666
F1_score 0.673738 0.693728 0.692200

Building a Decision Tree model¶

For decision tree we don't have to worry about multicollinearity or outlier.

In [82]:
# Creating Dummy variable for all object datatype in database without dropping first
innhotel_df4 = pd.get_dummies(innhotel_df, columns=['type_of_meal_plan', 'room_type_reserved', 'market_segment_type'])
In [83]:
# independent variables
X = innhotel_df4.drop(["booking_status"], axis=1)
# dependent variable
y = innhotel_df4[["booking_status"]]
In [84]:
# Spliting data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1
)
In [85]:
# Creating a model and fitting to data
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
Out[85]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
In [86]:
# Scoring on model
print("Accuracy on training set : ",dTree.score(X_train, y_train))
print("Accuracy on test set : ",dTree.score(X_test, y_test))
Accuracy on training set :  0.994210775047259
Accuracy on test set :  0.8704401359919139
In [87]:
# Function to evaluate model performance
def score(model,train,act,desc):
    """
      Inputs:
      Used to evaluate and check model performance
      model: model used to fit the data
      train : training set or testing set or X
      pred: predicted y using model
      act: actual data from dataset or y
      desc: just for printing to make sure test or train

      Outputs:
      Recall,Precesion,Accuracy,F1 score

    """
    pred = model.predict(train)
    pc_test = precision_score(act, pred)
    print("The precision score is {pc:.3f}".format(pc = pc_test))
    rc_test = recall_score(act, pred)
    print("The recall score is {rc:.3f}".format(rc = rc_test))
    ac_test = accuracy_score(act, pred)
    print("The accuracy score is {ac:.3f}".format(ac = ac_test))
    f1_test = f1_score(act, pred)
    print("The F1 score is {f1:.3f}".format(f1 = f1_test))
# defining a function to plot the confusion_matrix of a classification model
    cm = confusion_matrix(act, pred)
    plt.figure(figsize=(7, 5))
    sns.heatmap(cm, annot=True, fmt="g")
    plt.xlabel("Predicted Values")
    plt.ylabel("Actual Values")
    plt.show()
# Printing result
    df_pred = pd.DataFrame()
    df_pred["Recall"] = [rc_test]
    df_pred["Precesion"] = [pc_test]
    df_pred["Accuracy"] = [ac_test]
    df_pred["F1_score"] = [f1_test]
    print( "Result for the",desc,"model are:",'\n')
    return df_pred
In [88]:
# Evaluating training model performance
decisiontree_train_perf = score(dTree,X_train,y_train,'Training')
decisiontree_train_perf
The precision score is 0.996
The recall score is 0.987
The accuracy score is 0.994
The F1 score is 0.991
Result for the Training model are: 

Out[88]:
Recall Precesion Accuracy F1_score
0 0.986608 0.995776 0.994211 0.991171
In [89]:
# Evaluating testing model performance
decisiontree_test_perf = score(dTree,X_test,y_test,'Testing')
decisiontree_test_perf
The precision score is 0.797
The recall score is 0.804
The accuracy score is 0.870
The F1 score is 0.801
Result for the Testing model are: 

Out[89]:
Recall Precesion Accuracy F1_score
0 0.804373 0.79713 0.87044 0.800735
In [90]:
feature_names = list(X.columns)
print(feature_names)
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'arrival_date', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 1', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Meal Plan 3', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 1', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 3', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Aviation', 'market_segment_type_Complementary', 'market_segment_type_Corporate', 'market_segment_type_Offline', 'market_segment_type_Online']
In [91]:
importances = dTree.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
In [92]:
# importance of features in the tree building
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                                           Imp
lead_time                             0.347194
avg_price_per_room                    0.179874
market_segment_type_Online            0.094819
arrival_date                          0.081914
no_of_special_requests                0.068043
arrival_month                         0.064088
no_of_week_nights                     0.045813
no_of_weekend_nights                  0.037850
no_of_adults                          0.025906
arrival_year                          0.012276
required_car_parking_space            0.007353
type_of_meal_plan_Meal Plan 1         0.006648
room_type_reserved_Room_Type 4        0.005596
room_type_reserved_Room_Type 1        0.004630
type_of_meal_plan_Not Selected        0.003729
no_of_children                        0.003581
type_of_meal_plan_Meal Plan 2         0.002102
room_type_reserved_Room_Type 2        0.002022
room_type_reserved_Room_Type 5        0.001631
market_segment_type_Offline           0.001287
market_segment_type_Aviation          0.000759
room_type_reserved_Room_Type 7        0.000682
room_type_reserved_Room_Type 6        0.000669
market_segment_type_Corporate         0.000515
repeated_guest                        0.000483
no_of_previous_bookings_not_canceled  0.000371
no_of_previous_cancellations          0.000091
market_segment_type_Complementary     0.000075
type_of_meal_plan_Meal Plan 3         0.000000
room_type_reserved_Room_Type 3        0.000000

As you can see that 'lead_time','avg_price_per_room','market_segment_type_online','arrival_date' are some of the very imporatnt feature and contribute almost 70% to decision tree classification prediction

Do we need to prune the tree?¶

The tree above is very complex, it is over fitting the training data. So we need to prune the tree

Using GridSearch for Hyperparameter tuning of our tree model¶

Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.

It is an exhaustive search that is performed on a the specific parameter values of a model.

The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [93]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
    "criterion": ["entropy", "gini"],
    "splitter": ["best", "random"],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[93]:
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In [94]:
# Evaluating training model performance
decisiontree_train_perf_prune = score(estimator,X_train,y_train,'Training')
decisiontree_train_perf_prune
The precision score is 0.724
The recall score is 0.786
The accuracy score is 0.831
The F1 score is 0.754
Result for the Training model are: 

Out[94]:
Recall Precesion Accuracy F1_score
0 0.785962 0.724058 0.830852 0.753741
In [95]:
# Evaluating training model performance
decisiontree_test_perf_prune = score(estimator,X_test,y_test,'Testing')
decisiontree_test_perf_prune
The precision score is 0.728
The recall score is 0.783
The accuracy score is 0.835
The F1 score is 0.754
Result for the Testing model are: 

Out[95]:
Recall Precesion Accuracy F1_score
0 0.783078 0.727513 0.83488 0.754273
In [96]:
plt.figure(figsize=(15,10))

tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [97]:
# importance of features in the tree building ( The importance of a feature is computed as the
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

#Here we will see that importance of features has increased
                                           Imp
lead_time                             0.475993
market_segment_type_Online            0.184770
no_of_special_requests                0.169335
avg_price_per_room                    0.075421
no_of_adults                          0.026944
no_of_weekend_nights                  0.020608
arrival_month                         0.014138
required_car_parking_space            0.014114
market_segment_type_Offline           0.009958
no_of_week_nights                     0.007006
type_of_meal_plan_Not Selected        0.000950
arrival_date                          0.000761
no_of_previous_bookings_not_canceled  0.000000
room_type_reserved_Room_Type 4        0.000000
market_segment_type_Corporate         0.000000
market_segment_type_Complementary     0.000000
market_segment_type_Aviation          0.000000
room_type_reserved_Room_Type 7        0.000000
room_type_reserved_Room_Type 6        0.000000
room_type_reserved_Room_Type 5        0.000000
room_type_reserved_Room_Type 3        0.000000
no_of_previous_cancellations          0.000000
room_type_reserved_Room_Type 2        0.000000
room_type_reserved_Room_Type 1        0.000000
arrival_year                          0.000000
type_of_meal_plan_Meal Plan 3         0.000000
no_of_children                        0.000000
type_of_meal_plan_Meal Plan 1         0.000000
repeated_guest                        0.000000
type_of_meal_plan_Meal Plan 2         0.000000
In [98]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Cost Complexity Pruning¶

In [99]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [100]:
pd.DataFrame(path)
Out[100]:
ccp_alphas impurities
0 0.000000e+00 0.007572
1 4.327745e-07 0.007573
2 4.688391e-07 0.007573
3 5.329960e-07 0.007574
4 6.133547e-07 0.007575
... ... ...
1342 6.665684e-03 0.286897
1343 1.304480e-02 0.299942
1344 1.725993e-02 0.317202
1345 2.399048e-02 0.365183
1346 7.657789e-02 0.441761

1347 rows × 2 columns

In [101]:
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [102]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.07657789477371374
In [103]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Accuracy vs alpha for training and testing sets When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 87% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.

In [104]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [105]:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [106]:
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.00012004617212610687, random_state=1)
Training accuracy of best model:  0.9018982356647763
Test accuracy of best model:  0.8833961223927226
In [107]:
f1_train=[]
for clf in clfs:
    pred_train3=clf.predict(X_train)
    values_train=metrics.f1_score(y_train,pred_train3)
    f1_train.append(values_train)
In [108]:
f1_test=[]
for clf in clfs:
    pred_test3=clf.predict(X_test)
    values_test=metrics.f1_score(y_test,pred_test3)
    f1_test.append(values_test)
In [109]:
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("f1score")
ax.set_title("f1 vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [110]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0001350881814381269, random_state=1)

Confusion Matrix - post-pruned decision tree

In [111]:
# Evaluating training model performance
decisiontree_train_perf_postprune = score(best_model,X_train,y_train,'Training')
decisiontree_train_perf_postprune
The precision score is 0.858
The recall score is 0.822
The accuracy score is 0.897
The F1 score is 0.840
Result for the Training model are: 

Out[111]:
Recall Precesion Accuracy F1_score
0 0.822193 0.857784 0.896542 0.839612
In [112]:
# Evaluating testing model performance
decisiontree_testing_perf_postprune = score(best_model,X_test,y_test,'Testing')
decisiontree_testing_perf_postprune
The precision score is 0.834
The recall score is 0.796
The accuracy score is 0.883
The F1 score is 0.815
Result for the Testing model are: 

Out[112]:
Recall Precesion Accuracy F1_score
0 0.796422 0.833829 0.882753 0.814696
In [113]:
plt.figure(figsize=(17,15))

tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [114]:
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                                           Imp
lead_time                             0.403941
avg_price_per_room                    0.157766
market_segment_type_Online            0.142774
no_of_special_requests                0.100915
arrival_month                         0.055791
no_of_weekend_nights                  0.031373
arrival_date                          0.029059
no_of_adults                          0.026057
no_of_week_nights                     0.017415
arrival_year                          0.013908
required_car_parking_space            0.010165
type_of_meal_plan_Meal Plan 1         0.003322
room_type_reserved_Room_Type 4        0.001884
market_segment_type_Offline           0.001710
room_type_reserved_Room_Type 1        0.001345
room_type_reserved_Room_Type 5        0.001083
room_type_reserved_Room_Type 2        0.000616
no_of_children                        0.000478
type_of_meal_plan_Not Selected        0.000399
type_of_meal_plan_Meal Plan 3         0.000000
repeated_guest                        0.000000
room_type_reserved_Room_Type 3        0.000000
no_of_previous_bookings_not_canceled  0.000000
room_type_reserved_Room_Type 6        0.000000
room_type_reserved_Room_Type 7        0.000000
market_segment_type_Aviation          0.000000
market_segment_type_Complementary     0.000000
market_segment_type_Corporate         0.000000
no_of_previous_cancellations          0.000000
type_of_meal_plan_Meal Plan 2         0.000000
In [115]:
importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
In [116]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decisiontree_train_perf.T,
        decisiontree_train_perf_prune.T,
        decisiontree_train_perf_postprune.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[116]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Recall 0.986608 0.785962 0.822193
Precesion 0.995776 0.724058 0.857784
Accuracy 0.994211 0.830852 0.896542
F1_score 0.991171 0.753741 0.839612
In [117]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decisiontree_test_perf.T,
        decisiontree_test_perf_prune.T,
        decisiontree_testing_perf_postprune.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[117]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Recall 0.804373 0.783078 0.796422
Precesion 0.797130 0.727513 0.833829
Accuracy 0.870440 0.834880 0.882753
F1_score 0.800735 0.754273 0.814696

Model Performance Comparison and Conclusions¶

Logical Regression:

This model built can be used to predict the person's chance of cancelling resevation by F1 score of 69.22% with threshold of 0.42 and with precision and recall are almost 69.43% and 69% respectively.

Decision Tree Model:

This model built can be used to predict the person's chance of cancelling resevation by F1 score of 81.47% with Post Pruning method and with precision and recall are almost 83.38% and 79.64% respectively.

We can see lead_time, Avg_price_per_room, market_segment_type_online and no_of_special_request are very important feature with impotance score of 0.40,0.16,0.14 and 0.10 respectively.

Observation and Data summary¶

Inn hotel database has 36275 rows and 19 columns in the dataset. There were no missing values or duplicate in the data.

1) Almost 72% of the reservation has 2 adults followed by 21.21% of 1 adults and then 3 adults. There are few reservation has 0 adults and 4 adults also.

2) Almost 92.56% reservation has 0 no_of_children followed by 4.4% has 1 children in the reservation. Maximum number of children is 10 in the reservation.

3) Almost 76.73% reservation has selected Meal plan 1 followed by 14.14% does not selected any meal plan.Almost 9.11% reservation has selected meal plan 2 and only 0.01% has selected plan3

4) Almost 96.9% resevation does not show required car parking space and only 3.1% shows need of car parking space.

5) Almost 77.55% made Room_type1 reservation followed by 16.7% made Room Type 4 reservation. Remaining 5 type of room type reservation are adds upto less then 6%.

6) More then 5000 guest does not made reservation in advance.Average value for the lead time is 85.23 days and median value is 57 days. Some reservation has lead time of more then 400 days also.It shows right skewed data.

7) Almost 82% data is from year 2018 and only 17.95% data is from year 2017. Most people like to make reservation in the 6th and 10th month of the year. There are very few reservation are in first quarter of the year.

8) Only 2.57% reservation made by repeated guest. Most of the reservation are from new guests.

9) Average price for the reservation is 103.42 dollar and median is 99.45 dollar. It does have kind of normal distribution.

10) Almost 54.5% reservation has made 0 special request followed by 31.35% has made 1 request. Some reservation has 4 and 5 requests also.

11) Almost 64% are online market segment type followed by 29% offline segment type for reservation. Corporate market segment is about 5.56%.

12) Almost 67.24% reservation are not cancelled and 32.76% reservation are cancelled by customer.

Actionable Insights and Recommendations¶

1) We were able to acheive highest 81.46% F1 score with Decision Tree Post Pruning method compared with Logistic regression. We were able to get 79.65% Recall and 83.34% Precesion score.All the logistic regression models and Decision tree have given a generalized performance on the training and test set.

2) If booking made in well advanced then it has low average price per room but more cancellation happen when lead time is high or reservation is made in very well advance. If reservation is made more then 150 days in advance then it has 72% more chance of cancellation compared with only 23% chances of cancellation when it is made less then 150 days advance. Inn hotel should call customers who makes reservation in too advance to make sure that they still want to keep reservation and they should keep some strict policies for last minute cancellation and inform customer about them during the reservation.

3) Cancellation rate is high for online market segment type about 36% and very low for Corporate about 10%. It is almost 30% for Aviation market segment and Complimentary market segment has no cancellation. Inn hotel should do more corporate type reservation by contacting some near by corporate offices followed by Airline market segment type to avoid revenue loss.They should come up with some tie up with corporate and Airline.

4) When the customer make special request the chances of booking cancellation goes down with that. Customer without special request has cancellation rate of 43.2% and with just 1 request the cancellation rate is about 23.7%. With 3,4 and 5 request cancellation rate is 0%. They should take proper care of customer request and ask them if they have any request during reservation so it improves rating for Inn hotel and in turn get more customer base.

5) Only 2.57% reservation made by repeated guest. Most of the reservation are from new guests.Repeated guest has 98.28% chance of not cancelling the reservation and only 1.72% chance of cancelling reservation.Inn hotel should take good care of the all the guest and ask feedback of all the customer while they leave to make sure they visit again.

6) Cancellation chances is about 24% when booked by only 1 guest and when it is booked for 2 guest or 3 guest cancellation rate is almost 35%. Cancelllation rate is almost 44% when it is booked for 4 guest. Again they should follow up with the guest with more people in reservation if they have any special request and offer some complemetary stuff.

7) As you can see that reservation cost less if that made in early quarter of the year and it goes up during summer from month 5 to 9 and then starts going down during last 2 months.When you compare the rate for the year 2018 is quiet higher then compared with the year 2017.Inn hotel should make sure they have enough people to take care of the guest during the month 5 to 9 of the year so that they can give good customer service. During the first quarter and last 2 month of the year they do not need lot of staff as no of reservations are less.

8) Booking_status is in highly correlation with lead_time with factor of 0.44.It has positive correalation of 0.14 with avg_price_per_room also and 0.18 with arrival_year. avg_price_per_room is in positive correaltion with no_of_adults and no_of_children with factor of 0.3 and 0.34 respectively.